Load required packages

To load the required packages using the library() function, run chunk 1 by clicking on the “Run Current Chunk” button on the right. This will load the following packages.

library(XVector)
library(Seurat)
library(tidyverse)
library(Matrix)
library(RCurl)
library(scales)
library(sctransform)

Note: If you have not installed the packages yet, then install them first before loading

Load individual count matrices (cellranger output)

Read in the data by running chunk 2. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

data1 <- Read10X("PDAC_pbmc_1_filtered_feature_bc_matrix")
data2 <- Read10X("PDAC_pbmc_2_filtered_feature_bc_matrix")
data3 <- Read10X("PDAC_pbmc_3_filtered_feature_bc_matrix")
data4 <- Read10X("PDAC_pbmc_4_filtered_feature_bc_matrix")

#Let's take a look at how the data looks like
head(data1)
6 x 2661 sparse Matrix of class "dgCMatrix"
  [[ suppressing 56 column names ‘AAACCCAGTCTAATCG-1’, ‘AAACGAAAGTATAACG-1’, ‘AAACGAAGTCATAGTC-1’ ... ]]
                                                                                                                          
MIR1302-10   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FAM138A      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OR4F5        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RP11-34P13.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RP11-34P13.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AL627309.1   . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . .
                     
MIR1302-10   . ......
FAM138A      . ......
OR4F5        . ......
RP11-34P13.7 . ......
RP11-34P13.8 . ......
AL627309.1   . ......

 .....suppressing 2605 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................

Create Seurat Objects

Let’s use the individual count matrices to create separate Seurat objects by running chunk 3. The seurat object serves as a container for both the data (like the count matrix) and analysis (e.g. PCA, metadata) for a single-cell dataset.

data_seurat1 <- CreateSeuratObject(counts = data1, project = "Human-1", min.cells = 3, min.features = 200)
data_seurat2 <- CreateSeuratObject(counts = data2, project = "Human-2", min.cells = 3, min.features = 200)
data_seurat3 <- CreateSeuratObject(counts = data3, project = "Human-3", min.cells = 3, min.features = 200)
data_seurat4 <- CreateSeuratObject(counts = data4, project = "Human-4", min.cells = 3, min.features = 200)

(Optional) Cell Cycle Scoring

In some cases, there is a need for mitigating the effects of cell cycle heterogeneity in scRNA-seq data.This can be done by calculating cell cycle phase scores based on known cell cycle markers , and regressing these out of the data during pre-processing.

To perform cell cycle scoring, run chunk 4. In this chunk, we are first Log Normalizing individual seurat objects using the NormalizeData() function. Then, we are using the CellCycleScoring() function to assign each cell a cell cycle score, based on its expression of G2/M and S phase markers. Seurat stores the s.genes and g2m.genes in the “cc.genes.updated.2019” list. Note: If you receive any warning, read it carefully. You can ignore some warnings, while take action upon receiving some.

#segregate the "cc.genes.updated.2019" list into markers of G2/M phase and markers of S phase
s.genes <- cc.genes.updated.2019$s.genes
g2m.genes <- cc.genes.updated.2019$g2m.genes

#Prior to running "CellCycleScoring" command, each seurat object needs to be Lognormalized using "NormalizeData" function
data_norm1 <- NormalizeData(data_seurat1, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)
data_norm2 <- NormalizeData(data_seurat2, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)
data_norm3 <- NormalizeData(data_seurat3, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)
data_norm4 <- NormalizeData(data_seurat4, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)

#Now perform CellCycleScoring for each seurat objects
data_norm1 <- CellCycleScoring(data_norm1, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
Warning: The following features are not present in the object: UHRF1, CASP8AP2, not searching for symbol synonymsWarning: The following features are not present in the object: PIMREG, JPT1, GAS2L3, not searching for symbol synonyms
data_norm2 <- CellCycleScoring(data_norm2, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
Warning: The following features are not present in the object: UHRF1, EXO1, CASP8AP2, not searching for symbol synonymsWarning: The following features are not present in the object: PIMREG, JPT1, not searching for symbol synonyms
data_norm3 <- CellCycleScoring(data_norm3, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
Warning: The following features are not present in the object: UHRF1, CASP8AP2, not searching for symbol synonymsWarning: The following features are not present in the object: PIMREG, JPT1, not searching for symbol synonyms
data_norm4 <- CellCycleScoring(data_norm4, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
Warning: The following features are not present in the object: UHRF1, CASP8AP2, not searching for symbol synonymsWarning: The following features are not present in the object: PIMREG, JPT1, not searching for symbol synonyms
# view cell cycle scores and phase assignments
head(data_norm1)

Merge individual seurat objects into one

Run chunk 5 to merge all four seurat objects into one. The merge() function merges the raw count matrices of two or more Seurat objects creating a new Seurat object with a combined raw count matrix. Then, let’s take a look at the metadata of the merged seurat object using the View() function.

#NOTE: By default, merge() function combines Seurat objects based on the raw count matrices, erasing any previous normalization
data_merged <- merge(data_norm1, y = c(data_norm2, data_norm3, data_norm4), add.cell.ids = c("H1", "H2", "H3","H4"), project = "Human_1234")

# Make sure that the merge was successful
table(data_merged$orig.ident)

Human-1 Human-2 Human-3 Human-4 
   2405    3378    3434    5472 

Calculate additional quality control metrics

Run chunk 6 to calculate the mitochondrial and ribosomal transcript percentage per cell. Seurat has a function that enables us to do this. The PercentageFeatureSet() function can take a specific pattern and search through the dataset for that pattern. We can search for mitochondrial genes by looking for the pattern “MT-”. Similarly, for the ribosomal genes, we can look for the pattern “^RP[SL]”. Usually, cells with high proportions of mitochondrial genes are considered as poor-quality cells. On the other hand, percentage of ribosomal transcript per cell varies greatly from cell type to cell type. Therefore, caution should be taken to use percent.RIBO values to filter out low quality cells.

#The [[ operator can add columns to object metadata. This is a great place to stash QC stats
#First add column with mitochondiral gene expression
data_merged[["percent.MT"]] <- PercentageFeatureSet(data_merged, pattern = "^MT-")
#Add column with ribosomal gene expression
data_merged[["percent.RIBO"]] <- PercentageFeatureSet(data_merged, pattern = "^RP[SL]")
#NOTE: this calculation is performed per cell. That is why this step can be performed on merged data
#Now let's make sure that all the qc metrics are present in the metadata by using the head() function:
head (data_merged)

Visualize the common QC metrics

Run chunk 7 to plot the common QC metrics.

VlnPlot(data_merged, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)

VlnPlot(data_merged, features = c("percent.MT","percent.RIBO"), ncol = 2)

FeatureScatter(data_merged, feature1 = "percent.RIBO", feature2 = "percent.MT")

FeatureScatter(data_merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

Perform filtering

Now that we have visualized the various metrics, we can decide on the thresholds to apply which will result in the removal of low quality cells. Often the recommendations mentioned earlier are a rough guideline, and the specific experiment needs to inform the exact thresholds chosen. We will use the following thresholds:

Run chunk 8 to filter the merged dataset based on the parameters specified above. Here we are using the use the subset() function.

data_filtered <- subset(data_merged, subset = nFeature_RNA > 1000 & nCount_RNA < 35000 & percent.MT < 15 & percent.RIBO > 5)
VlnPlot(data_filtered, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)

VlnPlot(data_filtered, features = c("percent.MT","percent.RIBO"), ncol = 2)

FeatureScatter(data_filtered, feature1 = "percent.RIBO", feature2 = "percent.MT")

FeatureScatter(data_filtered, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

After performing the filtering, it’s recommended to look back over the metrics to make sure that your data matches your expectations and is good for downstream analysis. If not satisfied, you can re-run chunk 8 again and again with a different set of filtering parameters.

Save the filtered seurat object

Based on these QC metrics we would identify any failed samples and move forward with our filtered cells. Often we iterate through the QC metrics using different filtering criteria; it is not necessarily a linear process. When satisfied with the filtering criteria, we would save our filtered cell object for clustering and marker identification. Please run chunk 9 to save the filtered cells as a .rds file:

saveRDS(data_filtered, file = "data_filtered.rds")

——End——

---
title: "R Notebook - 1_QC_filtering_and_first_pass_clustering"
output: html_notebook
---
#### Some useful links
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
https://satijalab.org/seurat/articles/merge_vignette.html
https://satijalab.org/seurat/articles/integration_introduction.html
https://satijalab.org/seurat/articles/sctransform_vignette.html


#### Load required packages
To load the required packages using the library() function, run chunk 1 by clicking on the "Run Current Chunk" button on the right. This will load the following packages.

```{r chunk 1}
library(XVector)
library(Seurat)
library(tidyverse)
library(Matrix)
library(RCurl)
library(scales)
library(sctransform)
```
Note: If you have not installed the packages yet, then install them first before loading

#### Load individual count matrices (cellranger output)
Read in the data by running chunk 2. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

```{r chunk 2}
data1 <- Read10X("PDAC_pbmc_1_filtered_feature_bc_matrix")
data2 <- Read10X("PDAC_pbmc_2_filtered_feature_bc_matrix")
data3 <- Read10X("PDAC_pbmc_3_filtered_feature_bc_matrix")
data4 <- Read10X("PDAC_pbmc_4_filtered_feature_bc_matrix")

#Let's take a look at how the data looks like
head(data1)
```


#### Create Seurat Objects
Let's use the individual count matrices to create separate Seurat objects by running chunk 3. The seurat object serves as a container for both the data (like the count matrix) and analysis (e.g. PCA, metadata) for a single-cell dataset.

```{r chunk 3}
data_seurat1 <- CreateSeuratObject(counts = data1, project = "Human-1", min.cells = 3, min.features = 200)
data_seurat2 <- CreateSeuratObject(counts = data2, project = "Human-2", min.cells = 3, min.features = 200)
data_seurat3 <- CreateSeuratObject(counts = data3, project = "Human-3", min.cells = 3, min.features = 200)
data_seurat4 <- CreateSeuratObject(counts = data4, project = "Human-4", min.cells = 3, min.features = 200)
```

#### (Optional) Cell Cycle Scoring
In some cases, there is a need for mitigating the effects of cell cycle heterogeneity in scRNA-seq data.This can be done by calculating cell cycle phase scores based on known cell cycle markers , and regressing these out of the data during pre-processing.

To perform cell cycle scoring, run chunk 4. In this chunk, we are first Log Normalizing individual seurat objects using the NormalizeData() function. Then, we are using the CellCycleScoring() function to assign each cell a cell cycle score, based on its expression of G2/M and S phase markers. Seurat stores the s.genes and g2m.genes in the "cc.genes.updated.2019" list.
Note:  If you receive any warning, read it carefully. You can ignore some warnings, while take action upon receiving some.

```{r chunk 4}
#segregate the "cc.genes.updated.2019" list into markers of G2/M phase and markers of S phase
s.genes <- cc.genes.updated.2019$s.genes
g2m.genes <- cc.genes.updated.2019$g2m.genes

#Prior to running "CellCycleScoring" command, each seurat object needs to be Lognormalized using "NormalizeData" function
data_norm1 <- NormalizeData(data_seurat1, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)
data_norm2 <- NormalizeData(data_seurat2, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)
data_norm3 <- NormalizeData(data_seurat3, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)
data_norm4 <- NormalizeData(data_seurat4, normalization.method = "LogNormalize", scale.factor = 10000, verbose = FALSE)

#Now perform CellCycleScoring for each seurat objects
data_norm1 <- CellCycleScoring(data_norm1, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
data_norm2 <- CellCycleScoring(data_norm2, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
data_norm3 <- CellCycleScoring(data_norm3, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)
data_norm4 <- CellCycleScoring(data_norm4, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE, verbose = FALSE)

# view cell cycle scores and phase assignments
head(data_norm1)
```


#### Merge individual seurat objects into one
Run chunk 5 to merge all four seurat objects into one. The merge() function merges the raw count matrices of two or more Seurat objects creating a new Seurat object with a combined raw count matrix. Then, let's take a look at the metadata of the merged seurat object using the View() function.

```{r chunk 5}
#NOTE: By default, merge() function combines Seurat objects based on the raw count matrices, erasing any previous normalization
data_merged <- merge(data_norm1, y = c(data_norm2, data_norm3, data_norm4), add.cell.ids = c("H1", "H2", "H3","H4"), project = "Human_1234")

# Make sure that the merge was successful
table(data_merged$orig.ident)
```


#### Calculate additional quality control metrics
Run chunk 6 to calculate the mitochondrial and ribosomal transcript percentage per cell. Seurat has a function that enables us to do this. The PercentageFeatureSet() function can take a specific pattern and search through the dataset for that pattern. We can search for mitochondrial genes by looking for the pattern "MT-". Similarly, for the ribosomal genes, we can look for the pattern "^RP[SL]". Usually, cells with high proportions of mitochondrial genes are considered as poor-quality cells. On the other hand, percentage of ribosomal transcript per cell varies greatly from cell type to cell type. Therefore, caution should be taken to use percent.RIBO values to filter out low quality cells.

```{r chunk 6}
#The [[ operator can add columns to object metadata. This is a great place to stash QC stats
#First add column with mitochondiral gene expression
data_merged[["percent.MT"]] <- PercentageFeatureSet(data_merged, pattern = "^MT-")
#Add column with ribosomal gene expression
data_merged[["percent.RIBO"]] <- PercentageFeatureSet(data_merged, pattern = "^RP[SL]")
#NOTE: this calculation is performed per cell. That is why this step can be performed on merged data
#Now let's make sure that all the qc metrics are present in the metadata by using the head() function:
head (data_merged)
```


#### Visualize the common  QC metrics
Run chunk 7 to plot the common QC metrics. 

```{r chunk 7}
VlnPlot(data_merged, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
VlnPlot(data_merged, features = c("percent.MT","percent.RIBO"), ncol = 2)
FeatureScatter(data_merged, feature1 = "percent.RIBO", feature2 = "percent.MT")
FeatureScatter(data_merged, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
```


#### Perform filtering
Now that we have visualized the various metrics, we can decide on the thresholds to apply which will result in the removal of low quality cells. Often the recommendations mentioned earlier are a rough guideline, and the specific experiment needs to inform the exact thresholds chosen. We will use the following thresholds:

- nFeature_RNA > 500
- nCount_RNA < 700
- percent.MT < 25
- percent.RIBO > 3

Run **chunk 8** to filter the merged dataset based on the parameters specified above. Here we are using the use the subset() function.

```{r chunk 8}
data_filtered <- subset(data_merged, subset = nFeature_RNA > 1000 & nCount_RNA < 35000 & percent.MT < 15 & percent.RIBO > 5)
VlnPlot(data_filtered, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
VlnPlot(data_filtered, features = c("percent.MT","percent.RIBO"), ncol = 2)
FeatureScatter(data_filtered, feature1 = "percent.RIBO", feature2 = "percent.MT")
FeatureScatter(data_filtered, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
```

After performing the filtering, it’s recommended to look back over the metrics to make sure that your data matches your expectations and is good for downstream analysis. If not satisfied, you can re-run **chunk 8** again and again with a different set of filtering parameters.


#### Save the filtered seurat object
Based on these QC metrics we would identify any failed samples and move forward with our filtered cells. Often we iterate through the QC metrics using different filtering criteria; it is not necessarily a linear process. When satisfied with the filtering criteria, we would save our filtered cell object for clustering and marker identification. Please run **chunk 9** to save the filtered cells as a .rds file:

```{r chunk 9}
saveRDS(data_filtered, file = "data_filtered.rds")
```


------End------